神经网络的等级测量跨层流动的信息。它是关键结构条件的一个实例,适用于机器学习的广泛领域。特别是,低排名特征表示的假设会导致许多体系结构中的算法发展。然而,对于神经网络,产生低级别结构的内在机制仍然模糊不清。为了填补这一空白,我们对网络等级的行为进行了严格的研究,尤其关注排名不足的概念。从理论上讲,我们从差分和代数组成的基本规则中建立了通用的单调降低属性,并发现网络块和深度函数耦合的等级缺陷。借助我们的数值工具,我们提供了对实际设置中网络等级的每层行为的首次经验分析,即ImageNet上的重新NET,DEEP MLP和变压器。这些经验结果与我们的理论直接一致。此外,我们揭示了由深网的排名不足引起的一种新颖的独立赤字现象,在这种情况下,给定类别的分类信心可以通过少数其他类别的信心来线性地决定。这项工作的理论结果以及经验结果可能会提高对深神经网络固有原理的理解。
translated by 谷歌翻译
现有人重新识别(Reid)方法通常直接加载预先训练的ImageNet权重以进行初始化。然而,作为一个细粒度的分类任务,Reid更具挑战性,并且存在于想象成分类之间的大域差距。在本文中,通过自我监督的代表性的巨大成功的巨大成功,在本文中,我们为基于对比学习(CL)管道的对比训练,为REID设计了一个无人监督的训练框架,被称为上限。在预培训期间,我们试图解决学习细粒度的重点问题的两个关键问题:(1)CL流水线中的增强可能扭曲人物图像中的鉴别条款。 (2)未完全探索人物图像的细粒度局部特征。因此,我们在Up-Reid中引入了一个身份内 - 身份(i $ ^ 2 $ - )正则化,该正常化是从全局图像方面和本地补丁方面的两个约束:在增强和原始人物图像之间强制强制实施全局一致性为了增加增强的稳健性,而使用每个图像的本地斑块之间的内在对比度约束来完全探索局部鉴别的线索。在多个流行的RE-ID数据集上进行了广泛的实验,包括PersonX,Market1501,CuHK03和MSMT17,表明我们的上部Reid预训练模型可以显着使下游REID微调和实现最先进的性能。代码和模型将被释放到https://github.com/frost-yang-99/up -reid。
translated by 谷歌翻译
基于现有的基于解除拘淀的概括性的方法,即可在直接解开人称的旨在转变为域相关干扰和身份相关特征。然而,它们忽略了一些重要的特征在域相关干扰和身份相关特征中顽固地纠缠于,这是难以以无监督的方式分解的。在本文中,我们提出了一种简单但有效的校准功能分解(CFD)模块,专注于通过更明智的特征分解和强化策略来提高人员重新识别的泛化能力。具体地,校准和标准化的批量归一化(CSBN)旨在通过联合探索域内校准和域间标准化的多源域特征来学习校准的人表示。 CSBN限制每个域的特征分布的实例级别不一致,捕获内部域级别的特定统计信息。校准人称表示在细微分解为身份相关功能,域功能,剩余纠结的纠结之一。为了提高泛化能力并确保高度辨别身份相关特征,引入了校准的实例归一化(CIN)以强制执行判别ID相关信息,并滤除ID-Intrelate的信息,同时剩余的富互补线索纠缠特征进一步用于加强它。广泛的实验表明了我们框架的强烈概括能力。我们的模型由CFD模块赋予授权,显着优于多个广泛使用的基准测试的最先进的域广义方法。代码将公开:https://github.com/zkcys001/cfd。
translated by 谷歌翻译
布换人员重新识别(CC-REID)旨在在长时间匹配不同地点的同一个人,例如,超过日子,因此不可避免地满足换衣服的挑战。在本文中,我们专注于处理更具有挑战性的环境下的CC-Reid问题,即,只有一个图像,它可以实现高效和延迟的行人确定实时监控应用。具体而言,我们将步态识别作为辅助任务来驱动图像Reid模型来通过利用个人独特和独立布的步态信息来学习布不可知的表现,我们将此框架命名为Gi-Reid。 Gi-Reid采用两流架构,该架构由图像Reid-Stream和辅助步态识别流(步态流)组成。在推理的高计算效率中丢弃的步态流充当调节器,以鼓励在训练期间捕获捕获布不变的生物识别运动特征。为了从单个图像获取时间连续运动提示,我们设计用于步态流的步态序列预测(GSP)模块,以丰富步态信息。最后,为有效的知识正则化强制执行两个流的高级语义一致性。基于多种图像的布更换Reid基准测试的实验,例如LTCC,PRCC,Real28和VC衣服,证明了GI-REID对最先进的人来说。代码在https://github.com/jinx-ustc/gi -reid提供。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Benefiting from the intrinsic supervision information exploitation capability, contrastive learning has achieved promising performance in the field of deep graph clustering recently. However, we observe that two drawbacks of the positive and negative sample construction mechanisms limit the performance of existing algorithms from further improvement. 1) The quality of positive samples heavily depends on the carefully designed data augmentations, while inappropriate data augmentations would easily lead to the semantic drift and indiscriminative positive samples. 2) The constructed negative samples are not reliable for ignoring important clustering information. To solve these problems, we propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC) by mining the intrinsic supervision information in the high-confidence clustering results. Specifically, instead of conducting complex node or edge perturbation, we construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks. Then, guided by the high-confidence clustering information, we carefully select and construct the positive samples from the same high-confidence cluster in two views. Moreover, to construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples, thus improving the discriminative capability and reliability of the constructed sample pairs. Lastly, we design an objective function to pull close the samples from the same cluster while pushing away those from other clusters by maximizing and minimizing the cross-view cosine similarity between positive and negative samples. Extensive experimental results on six datasets demonstrate the effectiveness of CCGC compared with the existing state-of-the-art algorithms.
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译
Transformer has achieved impressive successes for various computer vision tasks. However, most of existing studies require to pretrain the Transformer backbone on a large-scale labeled dataset (e.g., ImageNet) for achieving satisfactory performance, which is usually unavailable for medical images. Additionally, due to the gap between medical and natural images, the improvement generated by the ImageNet pretrained weights significantly degrades while transferring the weights to medical image processing tasks. In this paper, we propose Bootstrap Own Latent of Transformer (BOLT), a self-supervised learning approach specifically for medical image classification with the Transformer backbone. Our BOLT consists of two networks, namely online and target branches, for self-supervised representation learning. Concretely, the online network is trained to predict the target network representation of the same patch embedding tokens with a different perturbation. To maximally excavate the impact of Transformer from limited medical data, we propose an auxiliary difficulty ranking task. The Transformer is enforced to identify which branch (i.e., online/target) is processing the more difficult perturbed tokens. Overall, the Transformer endeavours itself to distill the transformation-invariant features from the perturbed tokens to simultaneously achieve difficulty measurement and maintain the consistency of self-supervised representations. The proposed BOLT is evaluated on three medical image processing tasks, i.e., skin lesion classification, knee fatigue fracture grading and diabetic retinopathy grading. The experimental results validate the superiority of our BOLT for medical image classification, compared to ImageNet pretrained weights and state-of-the-art self-supervised learning approaches.
translated by 谷歌翻译